A Federated Learning and Deep Reinforcement Learning-Based Method with Two Types of Agents for Computation Offload

With the rise of latency-sensitive and computationally intensive applications in mobile edge computing (MEC) environments, the computation offloading strategy has been widely studied to meet the low-latency demands of these applications. However, the uncertainty of various tasks and the time-varying conditions of wireless networks make it difficult for mobile devices to make efficient decisions. The existing methods also face the problems of long-delay decisions and user data privacy disclosures. In this paper, we present the FDRT, a federated learning and deep reinforcement learning-based method with two types of agents for computation offload, to minimize the system latency. FDRT uses a multi-agent collaborative computation offloading strategy, namely, DRT. DRT divides the offloading decision into whether to compute tasks locally and whether to offload tasks to MEC servers. The designed DDQN agent considers the task information, its own resources, and the network status conditions of mobile devices, and the designed D3QN agent considers these conditions of all MEC servers in the collaborative cloud-side end MEC system; both jointly learn the optimal decision. FDRT also applies federated learning to reduce communication overhead and optimize the model training of DRT by designing a new parameter aggregation method, while protecting user data privacy. The simulation results showed that DRT effectively reduced the average task execution delay by up to 50% compared with several baselines and state-of-the-art offloading strategies. FRDT also accelerates the convergence rate of multi-agent training and reduces the training time of DRT by 61.7%.


Introduction
In recent years, with the development of mobile smart devices and 5G, many computationally intensive applications with low latency requirements have emerged, such as autonomous driving [1], virtual reality and augmented reality [2], online interactive gaming [3], and video streaming analysis [4]. These applications all have high requirements for quality of service (QoS). However, mobile devices (MDs) have limited computing power and are challenged by the growing demands for application computing power and increasingly stringent latency requirements.
To overcome this challenge, the mobile edge computing (MEC) [5] paradigm, as the core technology of 5G, pushes the computing resources on the network edge that is much closer to the MDs, thus relieving the network congestion and task delay of traditional centralized cloud computing. Unlike traditional cloud servers, MEC servers are not very resource-rich. Therefore, MDs need an efficient computation offloading strategy to determine whether to offload the tasks generated by MDs to MEC servers or the cloud server for execution, so as to fully utilize the computational resources to meet the quality of experience (QoE) of MDs and minimize the task execution latency. However, the uncertainty of computing tasks and the time-varying nature of wireless channels make it difficult for accurate and appropriate computation offloading decisions.
Reinforcement learning (RL) is a method for learning "what to do (i.e., how to map the current environment into an action) to maximize the numerical revenue signal" [6]. With the rise of artificial intelligence, the deep reinforcement learning (DRL) that combines RL and deep learning is considered as an effective method to find asymptotically optimal solutions in time-varying edge environments [7]. Without any prior knowledge, DRL can capture the hidden dynamics of the environment well by enhancing the intelligence of the edge network, so as to learn strategies and achieve optimal long-term goals through context-specific repeated interactions. Such a property allows DRL to show its unique potential in designing computation offloading strategies in dynamic MEC systems. In the DRL method, the user data are transmitted to a central server for model training, and the agent deployed on MDs learns the strategy. The centralized DRL will not only put pressure on the wireless network but also lead to the risk of user data leakage. However, the existing DRL-based methods [8][9][10] rarely consider the issue of data privacy protection.
To address these problems, this paper proposes a federated learning (FL) and DRLbased method with two types of agents for computation offload, named FDRT. FL [11] decouples the model from the user data and aggregates the model according to the uploaded local model parameters, thus achieving the balance between data privacy protection and data sharing. Applying FL to the MEC environment can realize decentralized distributed training and accelerate the parameter transmission and training speed of DRL agents. FDRT uses two types of DRL agents to better explore other MEC server resources for making the optimal offloading decision, thus achieving the goal of minimizing the task execution delay of each MD. The major contributions of this paper are summarized as follows.

•
We proposed a multi-agent collaborative computation offloading strategy, named DRT. By building a collaborative cloud-side end MEC system, DRT was used to design a double deep Q-network (DDQN)-based mobile device agent and a dueling DDQN (D3QN)-based MEC server agent, which correspondingly decided to compute tasks locally on MD or offload tasks to MEC server. DRT enabled the offloading strategy to consider task information, network status, and nearby MEC server resources to ensure the optimal decision for minimizing the execution delay of tasks.

•
We proposed an FL-based multi-agent training method, FDRT, to optimize the training of DRT. In FDRT, the MEC server aggregated the network parameters of MDs within its coverage to obtain the semi-global model, and the global model was aggregated from the semi-global models of nearby MEC servers, which reduced the parameter transmission of traditional FL training and the network overhead and thus improved the system's QoS. Meanwhile, FDRT enabled the privacy protection of user data.

•
We conducted two sets of simulation experiments. The experimental results showed that the proposed DRT achieved significant performance improvements over several baseline and state-of-the-art computation offloading strategies, reducing the average task execution delay by up to over 50%. We also demonstrated the effectiveness of FDRT, which reduced the training time of DRT by 61.7% and the average task execution delay by 2.8%.
minimize the long-term energy consumption. Zhang et al. [10] use the reconfigurable intelligent surface (RIS) technique to adjust the phase shift and amplitude of reflective elements for improving the wireless network link status and energy efficiency and also use the DDPG algorithm for the computation offloading strategy. Li et al. [28] also design a offloading strategy based on a DDPG approach where all computing tasks sequentially decide the computation location through an agent. Koo et al. [29] adopt a Q-learning algorithm to find a task offloading strategy through the device-to-device communication in the MEC environment. They cluster nearby devices and use the head device of each cluster making the decision to reduce the computing complexity of agents. Under the conditions of power constraint of IoT devices and wireless charging, Wei et al. [30] propose agents based on post-state learning algorithms, which are deployed on MDs for decision making.
In cyber-twin networks, Hou et al. [31] achieve fast task processing, dynamic real-time task allocation, and low training overhead by using a multi-agent deep deterministic policy gradient approach. There are several main drawbacks of the machine-learning-based methods. First, most of the existing methods use single-type DRL agents to accomplish the computation offloading decisions. They mainly consider the resources of MDs, connected MEC servers, and cloud servers but ignore the resources of other MEC servers in the MEC environment. In addition, it is difficult for MDs to access the resource information of MEC servers and cloud server, and accessing this information also causes decision latency. Second, in traditional machine learning architectures for MEC environment, data producers must frequently send and share data with third parties (e.g., MEC servers or cloud servers) to train their models. This not only causes the risk of user data privacy disclosure but also has a high demand for network bandwidth. In the time-varying wireless network environment, high communication overhead will undoubtedly have a negative impact on the decision delay.
To address these problems, we design two types of agents, a DDQN-based mobile device agent and a D3QN-based MEC server agent, to make the optimal offloading decision by considering the resources of near-neighbor MEC servers jointly. We apply the FL to the multi-agent collaborative training. FL can effectively protect user data privacy by keeping the data localized while only transmitting local model parameters for global model training. It has been applied in multiple domains in the MEC environment, including privacy protection and communication optimization for large-scale model training [32,33], content caching [34], malware and anomaly detection [35], task scheduling and resource allocation [36,37], computation offloading [38], etc. However, in most of the MEC federated learning systems, the parameters of all MDs are transmitted to an MEC server or the central cloud, which will not only cause a high network overhead and occupy the network resources in the core network or side layer but will also increase the latency of users waiting for the global model parameters, therefore reducing the system's QoS. We also optimize the FL training efficiency by building a decentralized federation learning system model. Each MEC server and MD deploys an MEC server agent and a mobile device agent, respectively. Each MEC server additionally deploys a mobile device agent model for the semi-global model of all mobile devices within its current base station range. The MEC server aggregates the semi-global model from the nearby MEC servers to generate the global model, which reduces the parameter transmission of traditional FL training and the network overhead, especially compared with a central cloud-based federation learning model. Table 1 summarizes the comparison between our work and recent representative related work. For the complexity, N denotes the number of devices in the network, T denotes the number of episodes for model training, and H denotes the number of steps per episode.

Network Model
In this paper, we considered the MEC system model with cloud-side end collaboration as shown in Figure 1. In the model, the scenario was seamlessly covered by M BSs that provided computation offloading services to N user MDs distributed within their range through a 5G wireless communication network, where the BSs were connected to each other via optical fiber. Each BS was equipped with an MEC server to provide computing power, so that the BS could compute a variety of tasks to meet the user's needs. The user's MD generated computing tasks that could be computed on the device or offloaded to an MEC server for execution. In addition, the BSs were connected to the core network via highspeed optical fiber, which in turn exchanged data with the central cloud, so that the MD could also offload tasks to the central cloud through the BS. The MDs were denoted as a set U = {u 1 , u 2 , . . . , u n , . . . , u N }, and the BSs were denoted as a set E = {e 1 , e 2 , . . . e m , . . . , e M }.
To facilitate the subsequent modeling, a time slot model was used to discretize the time into equal time intervals. The length of the time slot was denoted as T len , and the index was denoted as t = 0, 1, 2, . . . , T len . The size of T len was set as the coherence time. The channel could be kept constant during the coherence time. In communication systems, the communication channel may change with time. This channel variation is more significant in wireless communication systems due to the Doppler effect. Table 2 presents the main notations that are used in this paper. To facilitate the subsequent modeling, a time slot model was used to discretize the time into equal time intervals. The length of the time slot was denoted as , and the index was denoted as = 0,1,2, … , . The size of was set as the coherence time. The channel could be kept constant during the coherence time. In communication systems, the communication channel may change with time. This channel variation is more significant in wireless communication systems due to the Doppler effect. Table 2 presents the main notations that are used in this paper. Number of CPU cycles required to process 1 bit data of ( ) ( ) Task queue of ( ) Number of tasks in ( ) ( ) Total number of CPU cycles required to compute tasks of ( ) ( ) Task queue of ( ) Number of tasks in ( ) ( ) Total number of CPU cycles required to compute tasks of ( ) ( ) Execution delay of ( ) on ( ) Execution delay of ( ) on ( ) Execution delay of ( ) on a nearby MEC server ( ) Execution delay of ( ) on central cloud

Symbols Definition
h n,m (t) Wireless channel gain between mobile device u n and MEC server e m th time slot v n,m (t) Transmission rate v n,m (t) between u n and e m in the tth time slot d n (t) Computing task generated by u n th time slot s n (t) Data size computed by d n (t) c n (t) Number of CPU cycles required to process 1 bit data of d n (t) q U n (t) Task queue of u n b U n (t) Number of tasks in q U n (t) c U n (t) Total number of CPU cycles required to compute tasks of q U Total number of CPU cycles required to compute tasks of q E m (t) l L n (t) Execution delay of d n (t) on u n l E n (t) Execution delay of d n (t) on e m l E n (t) Execution delay of d n (t) on a nearby MEC server e m l C n (t) Execution delay of d n (t) on central cloud s U n (t) State of u n in the tth time slot a U n (t) Action of u n in the tth time slot State of e m for d n (t) in the tth time slot a E n,m (t) Action of e m for d n (t) in the tth time slot

Communication Model
In the MEC system network with cloud-edge-end collaboration, two main communication methods were included, i.e., edge-to-end wireless communication and edge-to-edge and edge-to-cloud wired communication. The transmission rate between BSs e m and e m was denoted as v E m,m , and the transmission rate between the BS e m and the central cloud was denoted as v E,C m . The two kinds of communications both obeyed stable and independent random processes with probability distribution functions as f un E v E and f un E,C v E,C , respectively. Next, the wireless communication methods between MDs and BSs are described and modeled in detail.
In order to minimize the execution delay of the computing tasks for MDs, we constructed a wireless communication model based on 5G technology. MDs may need to compute offload every T time to transmit data to the BS; therefore, the orthogonal fre-quency division multiplexing (OFDM) technique was used to assign different sub-channels to different devices to reduce the mutual interference between sub-channels and ensure the device transmission requirements.
In this paper, a Rayleigh fading model was constructed based on the free-space path loss model to simulate the channel in a dense building scenario that is very common in cities. In this scenario, there was no direct path between the transmitter and receiver, and the signal was attenuated, reflected, refracted, and diffracted by buildings or other objects. The channel remained stable within a time slot but changed from time slot to time slot. The channel gain h n,m (t) of the wireless channel is calculated by Equation (1) according to [26], where β n,m (t) denotes the channel fading factor of the Rayleigh distribution between u n and e m in the tth time slot, and its probability distribution function is f un B (β), A d denotes the radar gain of the BS connected to e m , c 0 denotes the light speed in vacuum, f c denotes the carrier frequency of the BS connected to e m , d n,m (t) denotes the distance between u n and e m in the tth time slot, and d e denotes the path loss index.
Since the distance between a BS and its deployed MEC server was close, the transmission delay was negligible. According to the Shannon equation C = B log 2 (1 + S/N), combined with Equation (1), the transmission rate v n,m (t) between u n and e m in the tth time slot can be calculated by Equation (2), v n,m (t) = B log 2 where B denotes the channel bandwidth between u n and e m , p 0 denotes the transmission power of u n , and N 0 denotes the Gaussian white noise power. Since the OFDM subchannels did not interfere with each other, the noise in the channel was only Gaussian white noise.

Task Model
At the beginning of the tth time slot, N MDs generate N indivisible tasks at the same time, and these tasks are independent. If no data were generated at the beginning of the tth time slot, s n (t) = 0. Since the data were actually generated in the (t − 1)th time slot of u n , which was processed from the tth time slot, they could be considered to be generated at the beginning of the tth time slot for the convenience of model representation. The computing task generated by u n at the beginning of the tth time slot was denoted by d n (t) := {s n (t), c n (t)}.

Computation Model
The computing power of the central cloud is thousands of times greater than that of MDs and MEC servers. Therefore, the CPU frequency of the central cloud was set to infinity, and the CPU frequencies of the MD and MEC servers were set to fixed values, denoted as f U and f E , respectively, where f E f U . The computing power of the MD and MEC servers was relatively limited. If a task was computed on an MD or an MEC server, there may have been other tasks running, and the task could not be executed immediately. Therefore, we set task queues on both MD and MEC servers for the queuing tasks, where the "first in, first out" principle was applied to the queues.
The computing tasks of an MD could be executed locally on the device, offloaded to the MEC server of its BS or the MEC server of its neighboring BS, or offloaded to the central cloud for execution. Different computing modes led to different latencies, and the four computing modes were analyzed as follows.
(1) Local Computing. In the tth time slot, u n 's task queue is denoted as q U n (t) := b U n (t), c U n (t) . When t = 0, q U n (0) = {0, 0}, i.e., there is no task in the queue. If the computing task d n (t) of u n is executed locally, the task will be added to the local queue, and its execution delay l L n (t) is calculated by Equation (3) according to [26].
(2) MEC Computing. In the tth time slot, e m 's task queue is denoted as is executed on e m , the task will be added to the task queue of the MEC server, and its execution delay l E n (t) is calculated by Equation (4) according to [26]. Since the resulting data of the task are usually very small, the transmission delay of the resulting data can be neglected.
(3) Near MEC Computing. If d n (t) is executed on one of its neighboring BS e m , the task will be added to the task queue of the MEC server e m , and the task execution delay l E n (t) can be calculated by Equation (5) according to [26].
(4) Cloud Computing. If d n (t) is executed in the central cloud, its execution time delay l C n (t) can be calculated by Equation (6) according to [26]. Since the computing power of the central cloud is overwhelmingly strong, the computational time of the task in the central cloud can be ignored.

Problem Statement
Based on the established system model, we first formalized the optimization problem of minimizing the long-term computational latency of tasks for the entire MEC system with limited resource constraints. The execution location of the computing task d n (t) in the tth time slot of u n is denoted as a n (t). When a n (t) = 0, it means that d n (t) is computed locally; when a n (t) = 1, it means that the task is executed on the MEC server of its connected BS; when a n (t) = 2, it means that the task is executed on the MEC server of its neighboring BS; and when a n (t) = 3, it means that the task is executed on the central cloud server. Jointly with Equations (3)-(6), the execution delay of the computing task d n (t) is calculated by Equation (7), ≫ . The computing power of the MD and ited. If a task was computed on an MD or an MEC server, s running, and the task could not be executed immediately. n both MD and MEC servers for the queuing tasks, where was applied to the queues. MD could be executed locally on the device, offloaded to MEC server of its neighboring BS, or offloaded to the cennt computing modes led to different latencies, and the four ed as follows.
tth time slot, 's task queue is denoted as ( ) 0, (0) = {0,0}, i.e., there is no task in the queue. If the is executed locally, the task will be added to the local elay ( ) is calculated by Equation (3) according to [26].
tth time slot, 's task queue is denoted as ( ) is executed on , the task will be added to the task queue execution delay ( ) is calculated by Equation (4) accordng data of the task are usually very small, the transmission can be neglected.
( ) is executed on one of its neighboring BS , the ask queue of the MEC server , and the task execution ated by Equation (5) according to [26].
) is executed in the central cloud, its execution time delay Equation (6) according to [26]. Since the computing power rwhelmingly strong, the computational time of the task in nored.
stem model, we first formalized the optimization problem omputational latency of tasks for the entire MEC system ts. The execution location of the computing task ( ) in ted as ( ). When ( ) = 0, it means that ( ) is comit means that the task is executed on the MEC server of its it means that the task is executed on the MEC server of its ( ) = 3, it means that the task is executed on the central ations (3)-(6), the execution delay of the computing task (7), ction. When the condition in the function bracket is true, otherwise, it is 0. Then, the optimization problem can be (a n (t) = 0)l L n (t) + and , respectively, where ≫ . The computing power of the MD and rs was relatively limited. If a task was computed on an MD or an MEC server, ave been other tasks running, and the task could not be executed immediately. we set task queues on both MD and MEC servers for the queuing tasks, where , first out" principle was applied to the queues. mputing tasks of an MD could be executed locally on the device, offloaded to rver of its BS or the MEC server of its neighboring BS, or offloaded to the cenr execution. Different computing modes led to different latencies, and the four modes were analyzed as follows.
When t = 0, (0) = {0,0}, i.e., there is no task in the queue. If the ting task ( ) of is executed locally, the task will be added to the local , and its execution delay ( ) is calculated by Equation (3) according to [26].
Computing. In the tth time slot, 's task queue is denoted as If ( ) is executed on , the task will be added to the task queue MEC server, and its execution delay ( ) is calculated by Equation (4) accord-26]. Since the resulting data of the task are usually very small, the transmission of the resulting data can be neglected.
EC Computing. If ( ) is executed on one of its neighboring BS , the ill be added to the task queue of the MEC server , and the task execution ( ) can be calculated by Equation (5) according to [26].
Computing. If ( ) is executed in the central cloud, its execution time delay an be calculated by Equation (6) according to [26]. Since the computing power central cloud is overwhelmingly strong, the computational time of the task in tral cloud can be ignored.
logy Statement on the established system model, we first formalized the optimization problem ing the long-term computational latency of tasks for the entire MEC system d resource constraints. The execution location of the computing task ( ) in e slot of is denoted as ( ). When ( ) = 0, it means that ( ) is comly; when ( ) = 1, it means that the task is executed on the MEC server of its S; when ( ) = 2, it means that the task is executed on the MEC server of its g BS; and when ( ) = 3, it means that the task is executed on the central r. Jointly with Equations (3)-(6), the execution delay of the computing task culated by Equation (7), is the indicator function. When the condition in the function bracket is true, f the function is 1; otherwise, it is 0. Then, the optimization problem can be s Equation (8), and , respectively, where ≫ . The computing power of the MD and MEC servers was relatively limited. If a task was computed on an MD or an MEC server, there may have been other tasks running, and the task could not be executed immediately. Therefore, we set task queues on both MD and MEC servers for the queuing tasks, where the "first in, first out" principle was applied to the queues.
The computing tasks of an MD could be executed locally on the device, offloaded to the MEC server of its BS or the MEC server of its neighboring BS, or offloaded to the central cloud for execution. Different computing modes led to different latencies, and the four computing modes were analyzed as follows.
(1) Local Computing. In the tth time slot, 's task queue is denoted as i.e., there is no task in the queue. If the computing task ( ) of is executed locally, the task will be added to the local queue, and its execution delay ( ) is calculated by Equation (3) according to [26].
(2) MEC Computing. In the tth time slot, 's task queue is denoted as is executed on , the task will be added to the task queue of the MEC server, and its execution delay ( ) is calculated by Equation (4) according to [26]. Since the resulting data of the task are usually very small, the transmission delay of the resulting data can be neglected.
is executed on one of its neighboring BS , the task will be added to the task queue of the MEC server , and the task execution delay ( ) can be calculated by Equation (5) according to [26].
is executed in the central cloud, its execution time delay ( ) can be calculated by Equation (6) according to [26].
Since the computing power of the central cloud is overwhelmingly strong, the computational time of the task in the central cloud can be ignored.

Problem Statement
Based on the established system model, we first formalized the optimization problem of minimizing the long-term computational latency of tasks for the entire MEC system with limited resource constraints. The execution location of the computing task  is calculated by Equation (7), where ( ) is the indicator function. When the condition in the function bracket is true, the value of the function is 1; otherwise, it is 0. Then, the optimization problem can be expressed as Equation (8), (a n (t) = 2)l E n (t) + denoted as and , respectively, where ≫ . The computing power of the MD and MEC servers was relatively limited. If a task was computed on an MD or an MEC server, there may have been other tasks running, and the task could not be executed immediately. Therefore, we set task queues on both MD and MEC servers for the queuing tasks, where the "first in, first out" principle was applied to the queues.
The computing tasks of an MD could be executed locally on the device, offloaded to the MEC server of its BS or the MEC server of its neighboring BS, or offloaded to the central cloud for execution. Different computing modes led to different latencies, and the four computing modes were analyzed as follows.
(1) Local Computing. In the tth time slot, 's task queue is denoted as When t = 0, (0) = {0,0}, i.e., there is no task in the queue. If the computing task ( ) of is executed locally, the task will be added to the local queue, and its execution delay ( ) is calculated by Equation (3) according to [26].
(2) MEC Computing. In the tth time slot, 's task queue is denoted as is executed on , the task will be added to the task queue of the MEC server, and its execution delay ( ) is calculated by Equation (4) according to [26]. Since the resulting data of the task are usually very small, the transmission delay of the resulting data can be neglected.
is executed on one of its neighboring BS , the task will be added to the task queue of the MEC server , and the task execution delay ( ) can be calculated by Equation (5) according to [26].
is executed in the central cloud, its execution time delay ( ) can be calculated by Equation (6) according to [26].

Problem Statement
Based on the established system model, we first formalized the optimization problem of minimizing the long-term computational latency of tasks for the entire MEC system with limited resource constraints. The execution location of the computing task is calculated by Equation (7), where ( ) is the indicator function. When the condition in the function bracket is true, the value of the function is 1; otherwise, it is 0. Then, the optimization problem can be expressed as Equation (8), where PU frequencies of the MD and MEC servers were set to fixed values, , respectively, where ≫ . The computing power of the MD and relatively limited. If a task was computed on an MD or an MEC server, en other tasks running, and the task could not be executed immediately.
ask queues on both MD and MEC servers for the queuing tasks, where ut" principle was applied to the queues.
g tasks of an MD could be executed locally on the device, offloaded to its BS or the MEC server of its neighboring BS, or offloaded to the cention. Different computing modes led to different latencies, and the four were analyzed as follows.
ting. In the tth time slot, 's task queue is denoted as ( ) )}.
When t = 0, (0) = {0,0}, i.e., there is no task in the queue. If the sk ( ) of is executed locally, the task will be added to the local execution delay ( ) is calculated by Equation (3) according to [26].
ting. In the tth time slot, 's task queue is denoted as ( ) )}.

If
( ) is executed on , the task will be added to the task queue rver, and its execution delay ( ) is calculated by Equation (4) accordce the resulting data of the task are usually very small, the transmission esulting data can be neglected.
omputing. If ( ) is executed on one of its neighboring BS , the dded to the task queue of the MEC server , and the task execution can be calculated by Equation (5) according to [26].
ting. If ( ) is executed in the central cloud, its execution time delay alculated by Equation (6) according to [26]. Since the computing power cloud is overwhelmingly strong, the computational time of the task in  6), the execution delay of the computing task by Equation (7), ndicator function. When the condition in the function bracket is true, nction is 1; otherwise, it is 0. Then, the optimization problem can be tion (8), ( ) is the indicator function. When the condition in the function bracket is true, the value of the function is 1; otherwise, it is 0. Then, the optimization problem can be expressed as Equation (8), of the BS and another BS continues to provide service when the MD exceeds the maximum service distance of the current BS. This optimization problem is an NP-hard problem. To effectively solve this problem, we proposed a DRL-based strategy with two types of agents for computation offload, i.e., DRT. DRT first divided the problem into two sub-problems, i.e., whether the MD executes computing tasks locally or which of the three computing modes of MEC computing, near-MEC computing, or cloud computing the task should perform. According to these two sub-problems, we approximately modeled them as a Markov decision process (MDP) and designed two types of DRL agents.

DDQN-Based Mobile Device Agent
We designed a DDQN-based mobile device agent, which was deployed on the MD to decide whether the MD executed computing tasks locally. The agent could make offloading decisions based only on the current computing task information, its task queues, and the network transmission rate between the MD and its connected MEC server. Since the agent was deployed on the MD, it could easily obtain such information. The three key elements of using the MDP to model the agent, including state space, action space, and reward function, are described as follows.

State
Due to the limited computing power of MDs, it was not suitable to deploy a complex agent on the MD. We minimized the state space and defined the state s U n (t) ∈ S U of the mobile device agent in the tth time slot, as shown in Equation (9). The state space includes the size of computed data generated by a task, the number of CPU cycles required to process 1 bit of the data, the number of tasks in the local queue, the total number of CPU cycles to compute tasks in the local queue, and the transmission rate between the MD and its connected MEC server.

Action
The goal of the mobile device agent was to choose the optimal action to minimize the task execution delay based on the current state. The agent was responsible for deciding whether to execute the computing task locally or not at each time slot. The action of the agent is represented by a U n (t), which is defined by Equation (10), where A U denotes the action set of the agent. When a U n (t) = 0, it means the task is executed locally, and a U n (t) = 1 means that the task is offloaded, and the subsequent decision is made by the agent of MEC server.

Reward
In general, the reward function was related to the optimization goal, which was to minimize the long-term computation latency of all tasks in the entire MEC system. RL was an effective method to maximize numerical benefits, and we used the RL to construct the reward function. Therefore, the value of reward function was negatively correlated with the value of the optimization problem. The value of the reward function, r U s U n (t), a U n (t) ∈ R U , where R U is the reward space, for the MD agent to take action a U n (t) under the state s U n (t) in the tth time slot is calculated by Equation (11) according to [39], where time n,m (t) indicates the subsequent execution time of the task d n (t) when it is offloaded for execution, and it is described in Section 4.3.3.

DDQN Model Training
We used the DDQN model [39] to train the mobile device agent to obtain the optimal offloading decision, which could reduce the high computational complexity caused by the state space explosion in the MDP. The DDQN model adopted an experience playback method to decouple the action selection and Q value calculation, where Q is the value of the reward function to take an action. During the training process, the agent maintained an experience playback pool and saved the transfer quads (s, a, r, s') of each time slot in the experience playback pool. The elements of the transfer quad were the current state, the action, the reward value taking an action under current state, and the next state taking an action, respectively. The agent randomly selected a small batch of samples in the experience playback pool to update the parameters of the network, that is, randomly selected some previous experience to learn. The update of the experience playback pool followed the principle of "first in, first out". The agent maintained the current network Q(s, a; θ) and the target networkQ(s, a;θ), where Q is used to select actions,Q is used to evaluate the value of the selected action, and θ andθ represent the parameters of Q andQ. The target network regularly updated its own parameters using the parameters of the current network, that is, to copy the parameters of the current network. The experience playback method was beneficial to accelerate the training convergence. Figure 2 shows the training process of the DDQN-based mobile device agent. In each single episode, we first initialized the system model according to the task information, the task queue information, and the wireless network transmission rate, including the parameter θ of the current network Q, the parameterθ = θ of the target networkQ, and the experience replay pool M. The model started iterative training when obtaining the initial state. In each time slot, the "ε-greedy" strategy was used to select actions. It obtained a random value within [0,1); if the value was greater than ε, i.e., a preset exploration rate, then it randomly selected an action to execute, and otherwise, it selected the action with the maximum output Q value of the current network. Then, the model executed the action, obtained the corresponding reward and the next state, and stored the transfer information into M. The agent selected batch samples from M to calculate the gradient of loss function and used the gradient descent method to back propagate the gradient to minimize the loss function. Finally, the current network converged to the optimal action-value function through continuous iterative training of the above steps.
The DDQN model could ensure that the agent still had the ability to explore other actions after the network roughly converged, so as to prevent it from falling into the local optimum.

D3QN-Based MEC Server Agent
The dueling DDQN (D3QN) model [40] has advantages in generalizing learning across actions without imposing any change to the underlying algorithm. This feature of D3QN is very suitable for the MEC server agent with a large action space. Therefore, we designed the D3QN-based MEC server agent, which is deployed on the MEC server, to make decisions for task offloading, i.e., it decided to compute the task on the MEC server or offload the task to a neighboring MEC server or to the central cloud for execution. Similar to the mobile device agent, we used the MDP to model the MEC server agent.  The DDQN model could ensure that the agent still had the ability to explore other actions after the network roughly converged, so as to prevent it from falling into the local optimum.

D3QN-Based MEC Server Agent
The dueling DDQN (D3QN) model [40] has advantages in generalizing learning across actions without imposing any change to the underlying algorithm. This feature of D3QN is very suitable for the MEC server agent with a large action space. Therefore, we designed the D3QN-based MEC server agent, which is deployed on the MEC server, to make decisions for task offloading, i.e., it decided to compute the task on the MEC server or offload the task to a neighboring MEC server or to the central cloud for execution. Similar to the mobile device agent, we used the MDP to model the MEC server agent.

State
The state , ( ) ∈ of the MEC server agent at the arrival of the computing task from is defined as Equation (12), where indicates the number of the MEC server near . The agent made offloading decisions based on the offloaded tasks from MDs, the task queue status of its own MEC server, and the task queue status of its neighboring MEC servers. Since the server agent was deployed on the MEC server of BS, and the BSs of telecom operators were also mutually trusted, the agent could easily access the task queue status of its neighboring MEC servers.

Action
The goal of the MEC server agent was to decide the execution location of the offloaded task, i.e., choose the optimal action to minimize the task execution delay based on the current state. The MEC server agent was responsible for deciding the computing mode of the task. The action for task , ( ) is defined by Equation (13), where A E denotes the action set of the server agent. When , ( ) = 0, it means the task is executed on the local MEC server, , ( ) = 1 means that the task is offloaded to the central cloud for execution, and , ( ) = 2, … , + 1 means that the task is offloaded to a neighboring MEC server for execution, and the numerical value represents the number of neighboring servers. , ( ) ∈ ∶= 0,1,2, … , + 1 .

State
The state s E n,m (t) ∈ S E of the MEC server agent e m at the arrival of the computing task from u n is defined as Equation (12), where M indicates the number of the MEC server near e m . The agent made offloading decisions based on the offloaded tasks from MDs, the task queue status of its own MEC server, and the task queue status of its neighboring MEC servers. Since the server agent was deployed on the MEC server of BS, and the BSs of telecom operators were also mutually trusted, the agent could easily access the task queue status of its neighboring MEC servers.

Action
The goal of the MEC server agent was to decide the execution location of the offloaded task, i.e., choose the optimal action to minimize the task execution delay based on the current state. The MEC server agent was responsible for deciding the computing mode of the task. The action for task d n,m (t) is defined by Equation (13), where A E denotes the action set of the server agent. When a E n,m (t) = 0, it means the task is executed on the local MEC server, a E n,m (t) = 1 means that the task is offloaded to the central cloud for execution, and a E n,m (t) = 2, . . . , M + 1 means that the task is offloaded to a neighboring MEC server for execution, and the numerical value represents the number of neighboring servers. a E n,m (t) ∈ A E := 0, 1, 2, . . . , M + 1 .

Reward
Similar to the MD agent, the value of reward function, r E s E n,m (t), a E n,m (t) ∈ R E , for the server agent taking action a E n,m (t) under state s E n,m (t) in the tth time slot, is calculated by Equation (14) according to [40]. The subsequent execution time of the task, time n,m (t), is given by Equation (15) according to [40] for different cases, i.e., when it is executed on a local MEC server (a E n,m (t) = 0), offloaded to the central cloud for execution (a E n,m (t) =1), or offloaded to a neighboring MEC server for execution (a E n,m (t) = 2, . . . , M + 1 ). It is related to the task information, the task queue information, and the wireless network transmission rate. r E s E n,m (t), a E n,m (t) = −time n,m (t). (14) time n,m (t) =

D3QN Model Training
The MEC server agent had larger and more complex state space and action space than the MD agent, especially the action space. When the action space was very large, D3QN performed better than the traditional DRL networks [40]. Therefore, we used the D3QN model to train the MEC server agent, so as to avoid the slow convergence of the MEC server agent and the inability to make timely decisions on computing task offloading. The training process of the D3QN model was similar to that of the DDQN model. The current network and the target network of D3QN model had the same functions as those of the DDQN model, but their structures were slightly different. The current network and the target network of the D3QN model had two sets of output parameters, which aggregated to output the Q value of each action. Due to the paper limitation, we will not describe these specifically, and the relevant network structure can be found in [40].

DRT
Based on these two types of agents and Equation (8), DRT obtained the optimal computation offload strategy for the MEC system through multi-agent cooperative training. The training process of DRT with multiple episodes is described as follow.
First, the system model and the parameters of each agent network were initialized before multi-episode iterative training. In a single episode, each mobile device agent obtained the current initial state and started training from that state. If the mobile device agent offloaded computing tasks to its own connected MEC server, the MEC server agent started a single episode training. Finally, through continuous iterative training and learning over multiple episodes, the current network convergence of each mobile device agent and each MEC server agent approximated the optimal action-value function, that is, the optimal unloading strategy for all tasks in the entire MEC system was learned, which minimized the calculation delay of all tasks in the entire MEC system.

Deployment of Two Types of Agents
Although DRT could dynamically and efficiently find the optimal computation offloading strategy, it also required a lot of computing resources to train agents. Therefore, the deployment of the two types of agents should be carefully considered in order to effectively use the computing resources in the MEC systems.
As described in Section 4, the MEC server agent is suitable to be deployed on the MEC server. For the mobile device agent, training it using the traditional RL training method on the MD would introduce two shortcomings. First, additional energy will be wasted by training separate agents for each MD. Second, MDs have limited computing power, and the cost of training the agent from scratch is too high. There are also several drawbacks if the mobile device agent is trained on the MEC server that the MD is connected to. (a) There will be communication overhead between the MD and the MEC server, which causes delays in making offloading decisions and makes it difficult to make decisions in real time. (b) If only the connected MEC server maintains the mobile device agent, it involves the migration of agent data when users move to another BS's coverage. If all MEC servers maintain agents for each MD, it will not only result in resource waste but also bring in extra synchronization of agent data. (c) The privacy of MDs may be compromised as the uploaded training data may be privacy-sensitive, especially in industrial information scenarios. (d) Although the training data can be transformed to protect privacy, the data received by the MEC server would lose some relevancies compared with the source data, making it difficult to optimize the offloading decision or even making a worse offloading decision. (e) A large amount of training data is always transmitted from MDs to the MEC server, which puts a heavy burden on the wireless channel of BSs. Therefore, we deployed the DDQN-based agent on MDs and the D3QN-based agent on the MEC server and proposed the FDRT optimization method for multi-agent training.

FDRT
We applied the FL to achieve the optimization method of the multi-agent training for the proposed DRT, named FDRT. As a cooperative framework of machine learning, FL can train models without accessing users' private data, and it can achieve decentralized distributed training with the mechanism of aggregation model. DRT needs to interact with the environment frequently for multi-agent collaborative training, which will generate a large number of network communications, increase the computing pressure of resourceconstrained mobile devices, and thus increase the latency of computation offload decisions. When combining the FL with DRT, due to the decentralized topology of FL, the mobile device agent only needs to communicate with its connected MEC server. The MEC server aggregates the local model parameters of MDs within its coverage and transmits the aggregation results to them. This efficient decentralized training model can greatly reduce network overhead, improve users' QoS, and reduce the training time of the multi-agent model.
In the multi-agent trained federation learning system model, each MEC server and each mobile device deployed an MEC server agent and a mobile device agent, respectively. Each MEC server also deployed an additional mobile device agent, which was used as the global model of all MDs within the BS of the server. The mobile device agent can be regarded as the client of the FL model, and its connected MEC server can be regarded as the parameter server. For the mobile device agent, the state space is the initial data set of the client of the FL model. The target network of the mobile device agent used the discounted sum of its outputQ value and the reward as the target value of the action in the current state and used the difference between the target value and the Q value of the current network to calculate the local loss function of the client, so as to update the current network parameters. The client updated the neural network parameters based on the local data set and uploaded the trained network parameters to the server. The server aggregates these updated parameters to obtain a global parameter and then transmitted it back to the clients for the next round of local training. Before parameter aggregation, each client performed multiple local training and parameter updates in a round of training.

FDRT Training and Semi-Global Aggregation
In the FL model for multi-agent training, the workflow was basically the same for all BSs. The workflow for a single BS with multiple MDs scenario is shown in Figure 3. First, a global model of the mobile device agent was initialized on the MEC server. The initial model parameters may be different for different MEC servers. The current network parameters of the model are distributed to all MDs within the coverage area of the BS. The MD synchronized the received parameters to the local current network and the target network. Then, the mobile device agent started local training, and the parameters were transmitted to the MEC server to which the MD was connected after every F times training in a round. If the device moved to the coverage of another BS during training, it transmitted the parameters of the current network to the reconnected MEC server, rather than transmitting the parameters back to the previous connected MEC server. The MEC server received the network parameters of all MDs in its coverage area and averaged the parameters to obtain a new network parameter, which was referred to as a semi-global model parameter. Further, the MEC server obtained the semi-global model parameters from its neighboring MEC servers to generate the global model parameter based on the weighted average of the number of devices in the coverage area. If a neighboring MEC server did not have any MDs, the semi-global model parameters of this server were not aggregated. Finally, the MEC server transmitted the global model parameters back to MDs. The MD synchronized the new parameters to the local current and target networks and then started a new round of training.
In traditional FL, local model parameters of all clients are transferred to a server for global parameter aggregation. In the proposed FDRT, mobile devices transmitted local model parameters to their connected MEC servers to calculate a semi-global parameter, and the MEC server aggregated the semi-global parameters of their neighboring MEC servers to obtain the global parameter. Therefore, compared with traditional FL, FDRT can alleviate the pressure of wireless network communication, reduce the delay of global model parameter transmission, and improve the efficiency of model training.
Algorithm 1 describes the training process of the mobile device agent of the FDRT. The training process of MEC server agent was similar. The time complexity of FDRT is O(TH), where T is the number of training episodes, and H is the number of steps per episode. The complexity of FDRT mainly came from the training process of the DDQN agent and the D3QN agent, which was consistent with the complexity of other DQN-based methods [41]. Get initial state s 0 ; 3: for t = 0 : MAX_STEP by 1 do 4: x ← random(0, 1) ; 5: if(x > ε) then 6: a t ← randint(0, 2) ; 7: else 8: a t ← argmax a Q(s t , a; θ t ) ; 9: end if 10: Perform action a t in the system model, get reward r(s t , a t ) and next state s t+1 ; 11: s t ← s t+1 ; 12: Put I t := (s t , a t , r(s t , a t ), s t+1 ) in M; 13: if(memory.is f ull()) then 14: continue; 15: end if 16: if((episode * MAX_STEP + t) mod F == 0) then 17: Upload θ t to connected MEC server; 18: θ t ← θ e ; 19: else 20: Randomly choose a batch sample from M to update parameter, parameter. Further, the MEC server obtained the semi-global model parameters from its neighboring MEC servers to generate the global model parameter based on the weighted average of the number of devices in the coverage area. If a neighboring MEC server did not have any MDs, the semi-global model parameters of this server were not aggregated. Finally, the MEC server transmitted the global model parameters back to MDs. The MD synchronized the new parameters to the local current and target networks and then started a new round of training. In traditional FL, local model parameters of all clients are transferred to a server for global parameter aggregation. In the proposed FDRT, mobile devices transmitted local model parameters to their connected MEC servers to calculate a semi-global parameter, and the MEC server aggregated the semi-global parameters of their neighboring MEC servers to obtain the global parameter. Therefore, compared with traditional FL, FDRT can alleviate the pressure of wireless network communication, reduce the delay of global model parameter transmission, and improve the efficiency of model training.
Algorithm 1 describes the training process of the mobile device agent of the FDRT. The training process of MEC server agent was similar. The time complexity of FDRT is O(TH), where T is the number of training episodes, and H is the number of steps per episode. The complexity of FDRT mainly came from the training process of the DDQN agent and the D3QN agent, which was consistent with the complexity of other DQN-based methods [41]. for u in U e do 3: θ e temp ← θ e temp + θ u ; 4: end for 5: θ e temp ← θ e temp /|U e | ; 6: count ← get_MD_num(e) ; 7: θ e ← count * θ e temp ; 8: for e in E e do 9: if(get_MD_num(e ) > 0) then 10: θ e ← θ e + get_MD_num(e ) * θ e temp ; 11: count ← count + get_MD_num(e ) ; 12: end if 13: end for 14: θ e ← θ e /count ; 15: return θ e ;

Comparison Strategies
We conducted two sets of simulation experiments to evaluate the performance of the proposed DRT and the FDRT, respectively. To verify the effectiveness of DRT, we compared it with six baseline strategies and two state-of-the-art offloading strategies. To verify the training efficiency of FDRT, we compared it with DRT and a traditional federated learning method (TFL) [11]. As this work aimed to minimize the system latency, we mainly evaluated the reward value and the task execution delay under different strategies. The reward value could not only reflect the system latency but also reflect the convergence trend of the agent.
The baseline strategies are denoted as MDL, MDO, MDR, MSL, MSC, and MSR. MDL represents the strategy that all computing tasks are executed on MDs locally; MDO represents the strategy that all MDs offload their tasks to MEC servers to decide the specific execution location; MDR represents the strategy that all MDs randomly select tasks to compute locally or offload tasks; MSL represents the strategy that all MEC servers execute tasks locally; MSC represents the strategy that all MEC servers offload tasks to the central cloud; and MSR represents the strategy that all MEC servers randomly select tasks to compute locally, offload tasks to a neighboring MEC server, or offload tasks to the central cloud.
In addition, two state-of-the-art computing task offloading strategies are the single MD agent (SMDA) method [27] and one MEC server agent (OMSA) method [26]. SMDA designs the DDQL-based agent deployed on MDs to make offload decisions based on the task information generated by current time slot, the task queue information of MD and its connected MEC server, and the network conditions of both. SMDA adopts the collaborative training of FL and blockchain to learn the optimal offloading strategy, which is consistent with the goal of our strategy, that is, to minimize the system latency. OMSA designs the DDQN-based agent deployed on the MEC server, which finds the offloading decision through DRL training to solve the MINLP optimization problem based on the task information, the task queue information of MDs and all MEC servers, and the network status of MDs and their connected MEC servers. The optimization objective of OMSA is to minimize the energy consumption with delay constraints.

Simulation Setting
We simulated a collaborative cloud-side end MEC system, which consisted of multiple BSs and multiple MDs. Figure 4 shows the topology diagram of MEC servers. In the simulation system, the number of MEC servers was set to six. The shaded circles indicate that there are MDs within the range of the BSs, and the blank circles indicate that there are no MDs within the range of the BSs. The number of MDs in each BS was set to five. In the training process of agents, the maximum episodes and the maximum steps of each episode were set to 200. decision through DRL training to solve the MINLP optimization problem based on the task information, the task queue information of MDs and all MEC servers, and the network status of MDs and their connected MEC servers. The optimization objective of OMSA is to minimize the energy consumption with delay constraints.

Simulation Setting
We simulated a collaborative cloud-side end MEC system, which consisted of multiple BSs and multiple MDs. Figure 4 shows the topology diagram of MEC servers. In the simulation system, the number of MEC servers was set to six. The shaded circles indicate that there are MDs within the range of the BSs, and the blank circles indicate that there are no MDs within the range of the BSs. The number of MDs in each BS was set to five. In the training process of agents, the maximum episodes and the maximum steps of each episode were set to 200.  Table 3 shows the model parameters and their simulation values. We used the parameters of Huawei's 5G AAU5619 wireless product (Huawei Technologies Co., LTD., Shenzhen, China) to simulate the BS transmitter and built a wireless communication model between MDs and BSs on this basis. In the MEC system, the wired communication model included the wired communication between BSs and between BSs and the central cloud. The computing model included the task model of MDs and the CPU frequency of MDs and MEC servers. Both the mobile device agent and the MEC server agent used the Adam optimizer and the mean square error loss function. The current network and target network structures of mobile device agents were four-layer fully connected neural networks, in which the numbers of hidden layer neurons were 128 and 64, respectively. The current network and target network structures of MEC server agent were a three-layer fully connected neural network and a one-layer two-branch network, in which the number of hidden layer neurons was 128.
We designed a simple movement model to simulate the movement of MDs, which was used to generate the distance between BSs and MDs. The MD initially moved ran-  Table 3 shows the model parameters and their simulation values. We used the parameters of Huawei's 5G AAU5619 wireless product (Huawei Technologies Co., LTD., Shenzhen, China) to simulate the BS transmitter and built a wireless communication model between MDs and BSs on this basis. In the MEC system, the wired communication model included the wired communication between BSs and between BSs and the central cloud. The computing model included the task model of MDs and the CPU frequency of MDs and MEC servers. Both the mobile device agent and the MEC server agent used the Adam optimizer and the mean square error loss function. The current network and target network structures of mobile device agents were four-layer fully connected neural networks, in which the numbers of hidden layer neurons were 128 and 64, respectively. The current network and target network structures of MEC server agent were a three-layer fully connected neural network and a one-layer two-branch network, in which the number of hidden layer neurons was 128.
We designed a simple movement model to simulate the movement of MDs, which was used to generate the distance between BSs and MDs. The MD initially moved randomly in a clockwise or counterclockwise direction along the connection line between the MEC servers in Figure 4, and there was a probability of 0.00001 that the MD would turn around and move in the opposite direction during the movement. The moving speed of the MD obeyed a uniform distribution of U(29.6,30.4) m/s, which is the general walking speed of people.
The simulation experiments were performed on a server equipped with two Inter(R) Xeon(R) Gold 6248 processors at 2.50 GHz (Intel Corporation, Santa Clara, CA, USA). All experimental codes were implemented in Python, version 3.6.8, and the PyTorch library, version 1.8.2.

Performance of DRT
We used the reward values to evaluate the performances of different computation offloading strategies. Since the change trend of the task execution delay was exactly opposite to that of the reward, the larger the reward value was, the smaller the task execution delay was. Figure 5 shows the comparison results of the reward with the DRT, MDL, MDO, and MDR strategies during the training process. The ordinate represents the average value of the cumulative reward sum of all MDs per episode. As the number of training episodes increased, the rewards of all strategies tended to converge. DRT achieved the highest reward compared with the other baseline strategies, which meant DRT could learn the best offloading decision with minimal latency for MDs. In the first 50 episodes, the rewards of MDO and MDR increased gradually. This was because both MDO and MDR offloaded tasks to the MEC server for further decision making by the MEC server agent, i.e., the increasing trends reflected the training process of the D3QN-based MEC server agent. The MDO obtained a reward close to that of DRT. This was because when the wireless network was stable for most of the time, it was more conducive for the mobile device agent to offload tasks to the more powerful MEC server to reducing the execution delay of all tasks. Due to the limited computing power of MDs, local computing had a limited effect on reducing the overall task execution delay. For the same reason, the MDL received the lowest reward and the worst task execution delay.  Figure 6 shows the comparison results of the reward with DRT, MSL, MSC, and MSR strategies. The DRT still achieved the highest reward compared with the other baseline strategies, which meant DRT could fully utilize the computing resources of the MEC servers and the central cloud to make the best offloading decision. In the first 25 episodes, the rewards of both MSL, MSC, and MSR increased gradually. The reason was that the DDQN-based mobile device agent was learning the offloading strategy, which also reflected the convergence performance of the MD agent. The reward obtained by MSL was close to that of DRT. Although DRT considered the computing resources of neighboring MEC servers to offload tasks, these resources were also used by other MEC servers, so the overall improvement was slightly better than MSL. However, compared with MSC and MSR, DRT showed significant performance improvements.  Figure 6 shows the comparison results of the reward with DRT, MSL, MSC, and MSR strategies. The DRT still achieved the highest reward compared with the other baseline strategies, which meant DRT could fully utilize the computing resources of the MEC servers and the central cloud to make the best offloading decision. In the first 25 episodes, the rewards of both MSL, MSC, and MSR increased gradually. The reason was that the DDQN-based mobile device agent was learning the offloading strategy, which also reflected the convergence performance of the MD agent. The reward obtained by MSL was close to that of DRT. Although DRT considered the computing resources of neighboring MEC servers to offload tasks, these resources were also used by other MEC servers, so the overall improvement was slightly better than MSL. However, compared with MSC and MSR, DRT showed significant performance improvements. Table 4 shows the comparison results of the average task delay between DRT and baseline strategies. The results of the single episode were selected from the multiple episodes before convergence, and the results of multiple episodes were calculated from the convergent episode. DRT reduced the average task execution delay by 41.2%, 5.9%, 17.2%, 2.8%, 31.7%, and 24.2% compared with MDL, MDO, MDR, MSL, MSC, and MSR, respectively. This was due to the fact that DRT adopted two types of agents to jointly learn the optimal offloading strategy, which could more comprehensively evaluate the entire MEC system resources to minimize the system latency.
DDQN-based mobile device agent was learning the offloading strategy, which also reflected the convergence performance of the MD agent. The reward obtained by MSL was close to that of DRT. Although DRT considered the computing resources of neighboring MEC servers to offload tasks, these resources were also used by other MEC servers, so the overall improvement was slightly better than MSL. However, compared with MSC and MSR, DRT showed significant performance improvements.  Table 4 shows the comparison results of the average task delay between DRT and baseline strategies. The results of the single episode were selected from the multiple episodes before convergence, and the results of multiple episodes were calculated from the   Figure 7 shows the comparison results of the reward with DRT, SMDA, and OMSA during the training process. With the increase in the training episode, both SMDA and OMSA tended to converge, and DRT obtained the highest reward. Compared with SMDA and OMSA, DRT could fully utilize the computing resources of MDs, MEC servers, and the central cloud. The reason for the huge fluctuation of OMSA was that it filled the experience replay pool in the first episode, and the agent already started to learn and update the network parameters in this episode, which caused overfitting. Therefore, when OMSA encountered a large number of states different from previous ones, the learned strategies did not necessarily perform well. When the reward during 0-100 episodes was zoomed in on, it could be observed that SMDA was close to DRT, but DRT converged faster than SMDA and OMSA. This was because SMDA applied FL to achieve decentralized training, which effectively reduced the communication overhead. However, DRT divided the computation offload decision problem into two sub-decision problems, which were respectively solved by the two types of agents. In this way, the state and action space of DRT agent was much smaller than that of SMDA and OMSA. It could not only enable the agent to learn the optimal strategy faster but also reduce unnecessary data transmission between MDs and MEC servers. Since SMDA and OMSA used a single type of agent, they needed to obtain MEC server information or mobile device information, which increased the burden of wireless networks and led to higher decision delay.  Table 5 shows the comparison results of average task delay between DRT and SMDA and OMSA. DRT reduced the average task execution delay by 8.0% and 50.3% compared with SMDA and OSMA, respectively. When the FDRT optimization method was applied to DRT, the task execution delay was further reduced. The performance will be presented in Section 6.4.

Performance of FDRT
Since FDRT combined the FL with DRT to optimize the training of multi-agent model, we compared FDRT with DRT and TFL. DRT used the original DRL method to train the model, and TFL [11] used a traditional federated learning method to train the mobile device agent. Figure 8 shows the comparison results of the loss with FDRT, DRT, and TFL during the training process. The ordinate denotes the average values of the cumulative loss sum of all mobile device agents in each episode. The loss values were kept at 0 at first and then rose sharply. This was because the experience replay pools of mobile device agents were not full yet, so the agents had not started learning. In addition, FDRT and TFL accelerated the convergence of mobile device agents, and the losses of the three methods were finally stabilized at a certain value. It can be seen from the zoomed in sub-figure that DRT converged at about 60 episodes, while FDRT converged at about 23 episodes, saving 61.7% of the training time for the mobile device agents.  Table 5 shows the comparison results of average task delay between DRT and SMDA and OMSA. DRT reduced the average task execution delay by 8.0% and 50.3% compared with SMDA and OSMA, respectively. When the FDRT optimization method was applied to DRT, the task execution delay was further reduced. The performance will be presented in Section 6.4.

Performance of FDRT
Since FDRT combined the FL with DRT to optimize the training of multi-agent model, we compared FDRT with DRT and TFL. DRT used the original DRL method to train the model, and TFL [11] used a traditional federated learning method to train the mobile device agent. Figure 8 shows the comparison results of the loss with FDRT, DRT, and TFL during the training process. The ordinate denotes the average values of the cumulative loss sum of all mobile device agents in each episode. The loss values were kept at 0 at first and then rose sharply. This was because the experience replay pools of mobile device agents were not full yet, so the agents had not started learning. In addition, FDRT and TFL accelerated the convergence of mobile device agents, and the losses of the three methods were finally stabilized at a certain value. It can be seen from the zoomed in sub-figure that DRT converged at about 60 episodes, while FDRT converged at about 23 episodes, saving 61.7% of the training time for the mobile device agents.   Figure 9 shows the comparison results of the average task execution delay of FDRT, DRT, and TFL. It can be observed that the learned strategies of FDRT and TFL were basically the same as that of DRT. The FDRT sped up the training convergence of mobile device agents, while still maintaining the optimal learned strategy. In addition, FDRT aggregated fewer network parameters of mobile device agents in each MEC server and reduced the network transmission of parameters more than DRT and TFL. Compared with DRT, FDRT reduced the average task execution delay by 2.8%, indicating that the strategy learned by FDRT was better. This was because the semi-global aggregation parameter mechanism of FDRT enabled each mobile device agent to learn more state information, so that it was more likely to take the optimal action when the state of the MD changed.   Figure 9 shows the comparison results of the average task execution delay of FDRT, DRT, and TFL. It can be observed that the learned strategies of FDRT and TFL were basically the same as that of DRT. The FDRT sped up the training convergence of mobile device agents, while still maintaining the optimal learned strategy. In addition, FDRT aggregated fewer network parameters of mobile device agents in each MEC server and reduced the network transmission of parameters more than DRT and TFL. Compared with DRT, FDRT reduced the average task execution delay by 2.8%, indicating that the strategy learned by FDRT was better. This was because the semi-global aggregation parameter mechanism of FDRT enabled each mobile device agent to learn more state information, so that it was more likely to take the optimal action when the state of the MD changed.  Figure 9 shows the comparison results of the average task execution delay of FDRT, DRT, and TFL. It can be observed that the learned strategies of FDRT and TFL were basically the same as that of DRT. The FDRT sped up the training convergence of mobile device agents, while still maintaining the optimal learned strategy. In addition, FDRT aggregated fewer network parameters of mobile device agents in each MEC server and reduced the network transmission of parameters more than DRT and TFL. Compared with DRT, FDRT reduced the average task execution delay by 2.8%, indicating that the strategy learned by FDRT was better. This was because the semi-global aggregation parameter mechanism of FDRT enabled each mobile device agent to learn more state information, so that it was more likely to take the optimal action when the state of the MD changed.

Discussion
The superior performance of this work over existing methods mainly comes from the following aspects. Compared with existing DRL-based methods with a single type of agent, DRT used two types of agents for MDs and MEC servers to collaboratively learn the offloading strategies. Hence, DRT could find the optimal decision by comprehensively considering all MEC server resources and network conditions in the MEC environment for minimizing the system latency. The DDQN agent and the D3QN agent also considered the application characteristics of MDs and MEC servers to reduce the high dimensionality of MDP. In addition, FDRT applied FL to DRT to accelerate the multi-agent collaborative training, while effectively providing user data privacy protection. The decentralized FL training method distributed the model aggregation among MEC servers to relieve the computing pressure of resource-constrained MDs. The semi-global aggregation mechanism reduced the parameter transmission and network overhead, which further shortened the delay of strategy decision. These optimizations were very effective for latency-sensitive and computationally intensive applications to quickly learn the computation offloading strategy in the time-varying MEC environment. Therefore, this work is very suitable for the real-word applications with a high requirement of task execution delay, such as autonomous driving, online interactive gaming, virtual reality applications, and video streaming analysis.
Since this work focuses on the system latency, one of its main limitations is that it may not be able to fully meet the application scenarios with a high requirement of energy consumption. When combining energy consumption into our current model, it needs to consider the heterogeneity of different resources to formulate the multi-objective optimization problem. It needs more consideration to realize a balanced optimal offloading strategy among many objectives such as energy consumption, task delay, system utilization, etc. This will be one of our future work directions. In addition, this work is a binary computation offloading method. For many large-scale applications, the partial computation offloading strategy is more suitable. We will also study the partially offloading method in the future.

Conclusions
In this paper, we proposed a computation offloading method based on DRL and FL with the goal of minimizing the task execution delay for a system of multi-base station multi-mobile device MEC networks with cloud-side end collaboration. First, we proposed a multi-agent computation offloading strategy using a DDQN MD agent and a D3QN MEC server agent, to collaboratively make decisions based on their own task information, resources, and time-varying network conditions. Second, we proposed a new FL-based parameter aggregation model that greatly reduced communication overhead and improved the training efficiency of the multi-agent DRL model, while avoiding user data privacy disclosure. Extensive simulation experiments validated the advantages of the proposed method in computing task execution delay and training efficiency over several baseline and state-of-the-art computation offloading methods. In future work, we will try to extend the scope of application of this work by jointly considering the energy consumption model to maximize the system utility with energy latency constraints. For large-scale real-world applications, it is also our future research direction to use the collaboration of DRL and FL, solving the partial computation offloading problem.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available in Tables 3-5 and Figures 5-9.